Two different robots#
The code for this example is implemented different_robots. Let us import it.
[1]:
from enki_env.examples import different_robots
Environment#
The environment contains one Thymio and one E-Puck. Otherwise it is very similar to the previous “same robots” example: same task, same reward, just different robots with (slightly in this case) different sensors.
To create the environment via script, run:
python -m enki_env.examples.different_robot.environment
[2]:
env = different_robots.make_env(render_mode="human")
env.reset()
env.snapshot()
The robots belong to different groups with different observation spaces
[3]:
env.group_map
[3]:
{'thymio': ['thymio_0'], 'e-puck': ['e-puck_0']}
[4]:
env.observation_spaces
[4]:
{'thymio_0': Dict('wheel_speeds': Box(-1.0, 1.0, (2,), float32), 'prox/value': Box(0.0, 1.0, (7,), float32)),
'e-puck_0': Dict('wheel_speeds': Box(-1.0, 1.0, (2,), float32), 'prox/value': Box(0.0, 1.0, (8,), float32))}
Baseline#
We adapted the Thymio baseline to work for the E-Puck
To evaluate the performances of both baselines via script, run:
python -m enki_env.examples.different_robots.baseline
[5]:
import inspect
print(inspect.getsource(different_robots.EPuckBaseline.predict))
def predict(self,
observation: Observation,
state: State | None = None,
episode_start: EpisodeStart | None = None,
deterministic: bool = False) -> tuple[Action, State | None]:
prox = np.atleast_2d(observation['prox/value'])
m = np.max(prox, axis=-1)
prox[m > 0] /= m[:, np.newaxis][m > 0]
ws = np.array([(-0.1, -0.25, -0.5, -1, -1, 0.5, 0.25, 0.1)], dtype=np.float32)
w = np.tensordot(prox, ws, axes=([1], [1]))
w[m == 0] = 1
return np.clip(w, -1, 1), None
To perform a rollout, we need to assign the policy to the whole group.
[6]:
rollout = env.unwrapped.rollout(max_steps=10, policies={'thymio': different_robots.ThymioBaseline(),
'e-puck': different_robots.EPuckBaseline()})
[7]:
rollout.keys()
[7]:
dict_keys(['thymio', 'e-puck'])
[8]:
rollout['thymio'].episode_reward, rollout['e-puck'].episode_reward
[8]:
(np.float64(-2.969223195758214), np.float64(-13.845665476374068))
Reinforcement Learning#
Let us now train and evaluate two RL policies for this task, one for each robot.
To perform this via script, run:
python -m enki_env.examples.different_robots.rl
[9]:
policies = different_robots.get_policies()
[10]:
policies.keys()
[10]:
dict_keys(['thymio', 'e-puck'])
[11]:
rollout = env.unwrapped.rollout(max_steps=10, policies=policies)
rollout['thymio'].episode_reward, rollout['e-puck'].episode_reward
[11]:
(np.float64(-3.5418375520177077), np.float64(-15.970891166964257))
Video#
To conclude, to generate a similar video as before, you can run
python -m enki_env.examples.different_robots.video
or run
[12]:
video = different_robots.make_video()
video.display_in_notebook(fps=30, width=640, rd_kwargs=dict(logger=None))
[12]:
[ ]: